-
Notifications
You must be signed in to change notification settings - Fork 474
Mtmd implementation #1261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Mtmd implementation #1261
Conversation
fcce175
to
9931d0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.
- Migration from LLaVA-specific classes to generic MTMD wrapper classes
- Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
- Updated executors to use MTMD tokenization instead of direct image embedding evaluation
- Comprehensive test coverage for the new MTMD functionality
Reviewed Changes
Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
SafeMtmdWeights.cs | New wrapper class for MTMD multimodal weights replacing LLavaWeights |
NativeApi.Mtmd.cs | Native P/Invoke surface for MTMD helper functions |
SafeMtmdModelHandle.cs | Native handle management for MTMD models with tokenization and evaluation |
SafeMtmdInputChunks.cs | Managed wrapper for native chunk collections returned by tokenizer |
SafeMtmdInputChunk.cs | Individual chunk wrapper with metadata access and token span views |
SafeMtmdEmbed.cs | Media embedding wrapper supporting images, audio, and raw data buffers |
LLamaInteractExecutor.cs | Updated interactive executor to use MTMD tokenization workflow |
LLamaInstructExecutor.cs | Updated instruct executor with MTMD preprocessing logic |
BatchedExecutor.cs | Added MTMD batch evaluation support for batched inference |
Conversation.cs | Extended conversation class with multimodal prompting and media queueing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
if (inferenceParams.MaxTokens == 0) | ||
{ | ||
_embeds.Clear(); | ||
args.WaitForInput = true; | ||
args.ReturnValue = false; | ||
return; | ||
} |
Copilot
AI
Sep 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MaxTokens == 0 check and its logic is duplicated in InstructExecutor. Consider extracting this into a shared method in the base class.
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
: IDisposable | ||
{ | ||
private int _nextSequenceId; | ||
private readonly List<IBatch> _batchQueue = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the changes from #1262 have been undone here? Probably an accident that requires rebasing?
} | ||
} | ||
|
||
private class MtmdChunkBatch : IBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the batched executor worked with llava previously was llava embedded the image, and then you could prompt with the raw embeddings (EmbeddingBatch
). That was nice, since the batched executor was totally independent from llava. Is that no longer possible with MTMD?
/// </summary> | ||
private bool _forked; | ||
private readonly List<SafeMtmdEmbed> _mtmdEmbeds = new(); | ||
private int? _mtmdLogitsIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only ever seems to be set to -1 or null?
_mtmdLogitsIndex = null; | ||
} | ||
|
||
public void QueueMedia(SafeMtmdEmbed embed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an overload of Prompt that takes a Span<SafeMtmdEmbed>
instead of enqueuing them? This removes some state and makes the ownership of the embed objects clearer (e.g. at the moment you could call QueueMedia
, then dispose the media, which would trigger an error on the next call to Prompt
when you tried to use that disposed object).
return sampler.Sample(conversation.Executor.Context.NativeHandle, conversation.GetSampleIndex(offset)); | ||
var ctx = conversation.Executor.Context.NativeHandle; | ||
if (conversation.MtmdLogitsIndex == -1) | ||
return sampler.Sample(ctx, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very odd to me - if you had multiple active conversations and prompted them all with an image then next time you would be sampling from index=-1
for all of the conversations! Maybe that's right, it just caught my eye.
}; | ||
} | ||
|
||
private static string? PtrToString(IntPtr ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be extracted out into a helper, it's duplicated elsewhere at the moment
/// Convert a UTF-8 encoded native string pointer into a managed <see cref="string"/>. | ||
/// Returns <c>null</c> when the pointer is zero. | ||
/// </summary> | ||
public static string? PtrToStringUtf8(IntPtr ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about duplicated code
/// Managed wrapper around <c>mtmd_bitmap*</c> resources. Instances own the native pointer | ||
/// and ensure proper cleanup when disposed. | ||
/// </summary> | ||
public sealed class SafeMtmdEmbed : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this owns a native resource it should probably implement SafeHandle
/// underlying native pointer (when created via <see cref="Copy"/>) or act as non-owning views | ||
/// produced by the tokenizer. | ||
/// </summary> | ||
public sealed class SafeMtmdInputChunk : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this owns a native resource it should probably implement SafeHandle
/// <summary> | ||
/// Managed lifetime wrapper around a native <c>mtmd_input_chunks</c> collection returned by the tokenizer. | ||
/// </summary> | ||
public sealed class SafeMtmdInputChunks : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this owns a native resource it should probably implement SafeHandle
/// <summary> | ||
/// Lightweight wrapper around the MTMD native context and its helpers. | ||
/// </summary> | ||
public sealed class SafeMtmdWeights : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to MtmdWeights for consistency with LLamaWeights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍
Prototype implementation: